Add support for direct store in epilogue and padding support for wave transfer without transpose #3465

EnricoDeg · 2025-12-19T11:08:34Z

Proposed changes

Summary:

Add support for direct store in epilogue instead of cshuffle (performance improvement for small K problems)
Add padding support for wave transfer without transpose
Add wave transfer with interleaved layout to support direct store
Enable new functionalities on GEMMs
Add optional new functionality support for grouped convolution fwd
Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

kabrahamAMD

LGTM, good work

kabrahamAMD · 2025-12-22T16:50:56Z

include/ck/tensor_operation/gpu/grid/gridwise_ab_transfer_wave_tiles.hpp

                                                       index_t,
                                                       index_t)
    {
        // Notes: padding is currently not supported


Update comment (or remove, as it provides no additional information to the error message)

krithalith · 2025-12-23T14:43:52Z

include/ck/tensor_operation/gpu/grid/gridwise_gemm_wmma_cshuffle_v3_common.hpp

+
    // Limitations of the current implementation:
    //  - no multiAB
    //  - GemmSpecialization Default


It seems like GemmSpec default is no longer a hard requirement for using WaveTransfer, but only when A B layouts are not row column respectively? If so update comment?

krithalith · 2025-12-23T14:46:22Z

include/ck/tensor_operation/gpu/grid/gridwise_gemm_wmma_cshuffle_v3_common.hpp

+
+    // We need to investigate if it makes sense to remove cshuffle for smaller types
+    // Currently we use direct store for NRepeat equal to 4 or 8. For 16 bit type we use at
+    // lease buffer store 64 bit for 16 contiguous threads -> 128 bytes in toral (full cache line)


typos: lease, toral

krithalith · 2025-12-23T15:50:46Z

LGTM, had some small notes and I have a few questions:

Has conv fwd performance with wavetransfer been benchmarked?
Has wavetransfer / directstore been benchmarked in general?
If no to the above two questions, should we make a new issue for this?
Why do we only have wavetransfer instances for grouped conv fwd vanilla 2D?
Where did the vanilla fwd 2d instances come from?

krithalith · 2025-12-24T09:15:25Z

LGTM, had some small notes and I have a few questions:

* Has conv fwd performance with wavetransfer been benchmarked?

* Has wavetransfer / directstore been benchmarked in general?

* If no to the above two questions, should we make a new issue for this?

* Why do we only have wavetransfer instances for grouped conv fwd vanilla 2D?

* Where did the vanilla fwd 2d instances come from?

Discussed offline. Basically benchmarks have been performed and wavetransfer did seem to be a decent bit better for 2D. No significant improvement for 3D so no instances added. We want to look into more automated benchmarks later.

Copilot

Pull request overview

This pull request adds support for direct store in epilogue and padding support for wave transfer without transpose, enabling performance improvements for small K problems and grouped convolution operations on GFX12 hardware.

Changes:

Added direct store epilogue option that bypasses cshuffle for better performance on small K problems
Implemented padding support for wave transfer without transpose operations
Added wave transfer with interleaved layout to support direct store functionality
Enabled new wave transfer instances for grouped convolution forward operations (F16 and BF16)

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`device_grouped_conv2d_fwd_wmma_cshufflev3_wave_transfer_nhwgc_gkyxc_nhwgk_f16_instance.cpp`	New F16 instance definitions for wave transfer grouped convolution
`device_grouped_conv2d_fwd_wmma_cshufflev3_wave_transfer_nhwgc_gkyxc_nhwgk_bf16_instance.cpp`	New BF16 instance definitions for wave transfer grouped convolution
`device_grouped_conv_fwd_wmma_cshufflev3_wave_transfer_instance.hpp`	Template instantiations for wave transfer instances with various tile sizes
`gridwise_ab_transfer_wave_tiles_interleave.hpp`	New interleaved wave tile transfer implementation for direct store support
`epilogue_direct_store.hpp`	New direct store epilogue that bypasses cshuffle for improved performance
`gridwise_gemm_wmma_cshuffle_v3_common.hpp`	Core logic for conditional epilogue type selection and wave transfer applicability checks
`gridwise_gemm_wmma_cshuffle_v3.hpp`	Added IsFusedKernel template parameter to control direct store usage
`gridwise_ab_transfer_wave_tiles.hpp`	Added padding support via PadGridDescriptor helper method
`epilogue_cshuffle_v3_wmma_base.hpp`	Added IsLDSNeeded() method for LDS usage detection
`device_grouped_conv_fwd_multiple_abd_wmma_cshuffle_v3.hpp`	Major refactoring to pass M_K and N_K descriptors, enable runtime descriptor transformation, and force MN padding
`device_grouped_gemm_wmma_splitk_cshuffle_v3.hpp`	Updated to use conditional epilogue type selection
`device_batched_gemm_wmma_cshuffle_v3.hpp`	Updated to use conditional epilogue type selection
`device_batched_gemm_wmma_cshuffle_v3_b_scale.hpp`	Updated to use conditional epilogue type selection
`device_gemm_reduce_wmma_cshuffle_v3.hpp`	Added IsFusedKernel=true to disable direct store for fused kernels
`device_gemm_multiple_d_layernorm_wmma_cshuffle_v3.hpp`	Added IsFusedKernel=true to disable direct store for fused kernels
`device_gemm_bias_add_reduce_wmma_cshuffle_v3.hpp`	Added IsFusedKernel=true to disable direct store for fused kernels
`CMakeLists.txt`	Added new source files to build configuration
`grouped_convolution_forward_wmma_cshufflev3.inc`	Added function declarations for new wave transfer instances
`grouped_convolution_forward.hpp`	Added instance registration calls for new wave transfer operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

transfer

Need to add template parameter. It will be removed during refactoring

EnricoDeg added the organization: streamhpc label Dec 19, 2025

EnricoDeg requested review from a team, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners December 19, 2025 11:08

EnricoDeg self-assigned this Dec 19, 2025

krithalith requested review from kabrahamAMD and zsotakal December 19, 2025 12:39

EnricoDeg force-pushed the streamhpc/remove_cshuffle branch from 32a7c74 to 461a40d Compare December 22, 2025 08:38

kabrahamAMD previously approved these changes Dec 22, 2025

View reviewed changes

EnricoDeg force-pushed the streamhpc/remove_cshuffle branch from 461a40d to 9134cbb Compare December 23, 2025 07:05

krithalith reviewed Dec 23, 2025

View reviewed changes

krithalith previously approved these changes Dec 24, 2025

View reviewed changes

EnricoDeg dismissed krithalith’s stale review via 64462d1 January 6, 2026 11:28

EnricoDeg dismissed kabrahamAMD’s stale review via 64462d1 January 6, 2026 11:28

EnricoDeg force-pushed the streamhpc/remove_cshuffle branch 2 times, most recently from 64462d1 to 7d685e7 Compare January 9, 2026 15:46

bartekxk requested a review from Copilot January 12, 2026 11:26

Copilot started reviewing on behalf of bartekxk January 12, 2026 11:27 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes

bartekxk previously approved these changes Jan 12, 2026

View reviewed changes

EnricoDeg added 14 commits January 13, 2026 07:56

Initial implementation removing cshuffle (without transpose)

8e763af

Integrate cshuffle removal in fwd convoltion

a990080

Refactor interleaved wave transfer and add padding support for wave

87e313d

transfer

Fix convolution fwd

e889ff1

Add parameter to force thread tile to device struct

b6d5215

Add fast instances using wave transfer and direct store

7993bc5

Fix bug

c1eb985

Restore example

f0f4f1c

Fused kernels can not use direct store yet

ad816e0

Need to add template parameter. It will be removed during refactoring

Fix for grouped gemm and add missing condition

8d68c55

Fix batched gemm

0ecbef7

Fix comments

1c87ebd

Fix for gemm_bias_add_reduce flavour

e05be35

Fix grouped gemm tile loop

ad8995e

EnricoDeg dismissed bartekxk’s stale review via ad8995e January 13, 2026 08:34

EnricoDeg force-pushed the streamhpc/remove_cshuffle branch from 7d685e7 to ad8995e Compare January 13, 2026 08:34

bartekxk approved these changes Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for direct store in epilogue and padding support for wave transfer without transpose #3465

Add support for direct store in epilogue and padding support for wave transfer without transpose #3465

Uh oh!

EnricoDeg commented Dec 19, 2025 •

edited

Loading

Uh oh!

kabrahamAMD left a comment

Uh oh!

kabrahamAMD Dec 22, 2025

Uh oh!

EnricoDeg Jan 6, 2026

Uh oh!

krithalith Dec 23, 2025

Uh oh!

EnricoDeg Jan 6, 2026

Uh oh!

krithalith Dec 23, 2025

Uh oh!

EnricoDeg Jan 6, 2026

Uh oh!

krithalith commented Dec 23, 2025

Uh oh!

krithalith commented Dec 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add support for direct store in epilogue and padding support for wave transfer without transpose #3465

Are you sure you want to change the base?

Add support for direct store in epilogue and padding support for wave transfer without transpose #3465

Uh oh!

Conversation

EnricoDeg commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

kabrahamAMD left a comment

Choose a reason for hiding this comment

Uh oh!

kabrahamAMD Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

EnricoDeg Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

krithalith Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

EnricoDeg Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

krithalith Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

EnricoDeg Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

krithalith commented Dec 23, 2025

Uh oh!

krithalith commented Dec 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

EnricoDeg commented Dec 19, 2025 •

edited

Loading